If you've ever tried to link non-position independent code into a
shared library on x86-64, you should have seen a fairly cryptic error
about invalid relocations and missing symbols. Hopefully this will
clear it up a little!
Let's start with a small program to illustrate.
$ cat function.c
int global = 100;
int function(int i)
return i + global;
$ gcc -c function.c
Firstly, inspect the disassembley of this function:
0000000000000000 <function>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 89 7d fc mov %edi,-0x4(%rbp)
7: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # d <function+0xd>
d: 03 45 fc add -0x4(%rbp),%eax
10: c9 leaveq
11: c3 retq
Lets just go through that for clarity:
- 0,1: save rbp to the stack and save the
stack pointer (rsp) to rbp. This common stanza is
setting up the frame pointer, which is essentially a rule used
by debuggers (mostly) to keep track of the base of the stack. It's
not important for now.
- 4:Move the value from edi to 4 bytes below the
stack pointer. This is moving the first argument (int i)
into the "red-zone", a 128-byte scratch area each function has
reserved below the stack pointer.
- 7,d: Move the value at offset 0 from the current
instruction pointer (rip) into eax (i.e. the return
value). Then add the incoming argument to it (retrieved from the
scratch area).
The IP relative move is really the trick here. We know from the
code that it has to move the value of the
global variable
here. The zero value is simply a place holder - the compiler
currently does not determine the address (i.e. how far away from the
instruction pointer the memory holding the
global variable
is. It leaves behind a relocation -- a note that says to the
linker "you should determine the correct address of foo, and
then patch this address of the code to point to that addresss
(i.e. foo)."
The image above gives some idea of how it works. We can examine
relocations with the
readelf tool.
$ readelf --relocs ./function.o
Relocation section '.rela.text' at offset 0x518 contains 1 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000000009 000800000002 R_X86_64_PC32 0000000000000000 global + fffffffffffffffc
If you try and build a shared object (dynamic library) with this
function, you should get something like
$ gcc -shared function.c
/usr/bin/ld: /tmp/ccQ2ttcT.o: relocation R_X86_64_32 against a local symbol' can not be used when making a shared object; recompile with -fPIC
/tmp/ccQ2ttcT.o: could not read symbols: Bad value
collect2: ld returned 1 exit status
Position Independent Code (PIC, enabled with
-fPIC)
just means that the output binary does not expect to be loaded at a
particular base address, but is happy being put anywhere in memory
(compare the output of
readelf --segments on a binary such as
/bin/ls to that of any shared library). This is obviously
critical for implementing shared-libraries, where you may have many,
many libraries loaded in essentially any order, and trying to
pre-allocate where in memory they would all live just does not work.
However, PIC has slightly wider implications than simply movable base
addresses.
So, back to relocations. The exact rules for different relocation
types are described in the ABI for the architecture. The
R_X86_64_PC32 relocation is defined as "the base of the
section the symbol is within, plus the symbol value, plus the addend".
The addend makes it look more tricky than it is; remember that when an
instruction is executing the instruction pointer points to the
next instruction to be executed. Therefore, to correctly find
the data relative to the instruction pointer, we need to subtract the
extra. This can be seen more clearly when layed out in a linear
fashion (as in the bottom of the above diagram).
What's the problem with this in a shared library? In a shared
library situation, we can not depend on the local value of
global actually being the one we want. Consider the
following example, where we override the value of global with a
LD_PRELOAD library.
$ cat function.c
int global = 100;
int function(int i)
return i + global;
$ gcc -fPIC -shared -o libfunction.so function.c
$ cat preload.c
int global = 200;
$ gcc -shared preload.c -o libpreload.so
$ cat program.c
#include <stdio.h>
int function(int i);
int main(void)
printf("%d\n", function(10));
$ gcc -L. -lfunction program.c -o program
$ LD_LIBRARY_PATH=. ./program
110
$ LD_PRELOAD=libpreload.so LD_LIBRARY_PATH=. ./program
210
If the code in
libfunction.so has a fixed offset into its
own data section, it will not be able to see the overridden value
provided by
libpreload.so. This is not the case when
building a stand-alone executable, where references are satisfied
internally.
Of course, any problem in computer science can be solved with a
layer of abstraction, and that is what is done when compiling with
-fPIC. To examine this case, let's see what happens with PIC
turned on.
$ gcc -fPIC -shared -c function.c
$ objdump --disassemble ./function.o
./function.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <function>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 89 7d fc mov %edi,-0x4(%rbp)
7: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # e <function+0xe>
e: 8b 00 mov (%rax),%eax
10: 03 45 fc add -0x4(%rbp),%eax
13: c9 leaveq
14: c3 retq
It's almost the same! We setup the frame pointer with the
first two instructions as before. We push the first argument into
memory in the pre-allocated "red-zone" as before. Then, however, we
do an IP relative load of an address into
rax. Next we
de-reference this into
eax (e.g.
eax = *rax in C)
before adding the incoming argument to it and returning.
$ readelf --relocs ./function.o
Relocation section '.rela.text' at offset 0x550 contains 1 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000000a 000800000009 R_X86_64_GOTPCREL 0000000000000000 global + fffffffffffffffc
The magic here is again in the relocations. Notice this time we
have a
P_X86_64_GOTPCREL relocation. This says "replace the
data at offset
0xa with the global offset table (GOT)
entry of
global.
As shown above, the GOT ensures the abstraction required so symbols
can be diverted as expected. Each entry is essentially a pointer to
the real data (hence the extra dereference in the code above). Since
the GOT is at a fixed offset from the program code, it can use an IP
relative address to gain access to the table entries.
This extra reference is obviously slower; however for the most part
I imagine the overhead would be essentially immeasurable and is
required for "generic" operation. If you have figured the cost of
indirection through the GOT is the major bottleneck of your program, I
imagine you wouldn't be reading this and would already be considering
strategies to remove it!
The next question is why this works on plain old x86-32.
Inspecting the code reveals why:
$ objdump --disassemble ./function.o
00000000 <function>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: a1 00 00 00 00 mov 0x0,%eax
8: 03 45 08 add 0x8(%ebp),%eax
b: 5d pop %ebp
c: c3 ret
$ readelf --relocs ./function.o
Relocation section '.rel.text' at offset 0x2ec contains 1 entries:
Offset Info Type Sym.Value Sym. Name
00000004 00000701 R_386_32 00000000 global
We start out the same, with the first two instructions setting up
the frame pointer. However, next we load a memory value into
eax -- as we can see from the relocation information, the
address of
global. Next we add the incoming argument from
the stack (
0x8(%ebp)) to the value in this memory location;
implicitly dereferencing it. This provides the abstraction we need --
if the relocation makes the patched address at
0x4 the
address of the GOT entry, it will be correctly dereferenced. It is
the inability of the x86-32 architecture to try and optimise by doing
instruction-pointer relative offseting which means it always needs to
do slower memory references, which turns out to be just what you want
when you're making a shared library!
So, the executive summary: the ability of x86-64 to use
instruction-pointer relative offsetting to data addresses is a nice
optimisation, but in a shared-library situation assumptions about the
relative location of data are invalid and can not be used. In this
case, access to global data (i.e. anything that might be changed
around on you) must go through a layer of abstraction, namely the
global offset table.